Experiments in Data Compression V
          (or "It seemed like a good idea at the time")
                         by John Kormylo
                         E.L.F. Software


State of the Art

     Since  Report  IV,  I  have fixed a couple of  bugs  in  the 
"Smart" algorithm.  The current comparisons are as follows:

           | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-----------+--------------+-------------+--------------+
    ZIP    |       25134  |      32417  |      496876  |
  Best 16K |       21960  |      32372  |      446606  |
 Smart 16K |       22076  |      32359  |      489471  |
 Smart 8K  |       22214  |      32190  |      483486  |
-----------+--------------+-------------+--------------+
 Adapt 8K  |       22317  |      32057  |      484332  |
-----------+--------------+-------------+--------------+

The  "16K"  and "8K" refer to the number of entries in  the  main 
dictionary.   Note  that the "Smart" method now out-performs  the 
"Best" method on one of the files.
     Unfortunately,  the results were not so good when  archiving 
ELFARC v1.4, as shown by the following table:

           | ELFARC.PRG | ELFARC.RSC |  READ_ME  |
-----------+------------+------------+-----------+
    ZIP    |     32988  |      6255  |     1000  |
  Best 16K |     33143  |      6314  |     1037  |
 Smart 8K  |     33107  |      6305  |     1028  |
-----------+------------+------------+-----------+
 Adapt 8K  |     32978  |      6192  |     1043  |
-----------+------------+------------+-----------+

Emulating the Competition

     A number of tests were made to make LZA more like ZIP.  Most 
changes (such as increasing the maximum match length to 256 bytes 
or adding every possible string to the dictionary) resulted in  a 
loss of performance.
     One  feature of LZA which has never really been  tested  for 
effectiveness is the "new character" code.   Instead of assigning 
and  initial  count of 1 to each of the 320 possible  codes  (256 
ASCII  and 64 length),  LZA uses a "new character" code which  is 
followed by the character to be added.    The reason for this  is 
to  improve the performance on text files,  where only  about  96 
ASCII  characters are ever used.   However,  it also  imposes  an 
overhead which is never recovered for binary files or short  text 
files, which is primarily what archive programs are used for.
     Removing  the "new character" code is a funndamental  change 
in  the LZA format.   Consequently a number of other  fundamental 
changes which had not been implemented previously simply  because 
they  were fundamental (such as sorting the codes  by  frequency) 
were also implemented.
     First,  the  code set was reduced to 258 (256 ASCII and  one 
each for the FIFO buffer and main dictionary).   The length  code 
was implemented separately,  also using full adaptive.   This was 
done  so  that the ASCII codes would not be penalized  by  counts 
associated  with  all the length codes (using  256  length  codes 
instead of 64 would have added 1 bit to every character, at least 
initially).   The FIFO lag and dictionary range codes remained as 
before.
     The  results  for  this new method are shown  in  the  above 
tables as "Adapt 8K."  As expected,  it performed slightly  worse 
on text files,  but better on binary files.  More importantly, it 
beat ZIP on every file but one.

FIFO Buffer - Layered Binary Search Tree

     While using a hash table for the FIFO buffer is simpler  and 
probably faster for searches,  using a layered binary search tree 
takes  much  less RAM and adds and removes entries  much  faster.  
The resulting tree will use less than 2K RAM,  as opposed to  64K 
RAM for the hash table (for a maximum match length of 65  bytes). 
The speed advantage is best illustrated by noting that the entire 
FIFO buffer is replaced every 64 entries.
     The  search tree used for the FIFO table differs  from  that 
used in the main dictionary in that it includes an index into the 
table of string pointers.

typedef struct fifo {
  struct fifo *child1;    /* less than */
  struct fifo *child2;    /* greater than */
  struct fifo *next;      /* start of next tree */
  unsigned char *text;    /* string pointer */
  unsigned int sum;       /* sum of all children */
  unsigned char len;      /* number of common chars */
  unsigned char test;     /* first uncommon char */
  int index;              /* fifo table index */
 } FIFO;

One  could drop the text pointer and use the text  pointer  array 
index, but it is faster to keep it.
     This  search tree also differs from the main  dictionary  in 
how  it  adds and removes entries.   The main dictionary  used  a 
fixed  array  of 8K (or 16K) entries,  with the  duplicate  nodes 
coming from the free structure list.   When removing a node,  one 
must also search the remaining layers for duplicates.
     The  FIFO  buffer allocates all of its nodes from  the  free 
structure  list.   Whenever a new string is added,  the text  and 
index  values for its matches within each layer are set  to  this 
(latest)  entry,  so  that one will always find the  most  recent 
match.   This also means that when it is time to remove an entry, 
there will be no duplicates remaining in the tree.
     There was about a 15% improvement in speed from this change.

     One  could  use a common tree structure for  both  the  FIFO 
buffer  and  main  dictionary by adding the  index  to  the  main 
dictionary  and replacing or augmenting the text entry with a  an 
array of string pointers (refered to by index).  While this would 
simplify  the  code,  allowing common functions  for  adding  and 
removing entries, it would be slower and use more RAM.

Tweaking the Code

     Using  Pure Profiler I was able to improve some of the  code 
w.r.t. speed.  One thing I noticed is that Pure C compiled

     while(p) {
          if(p->test == ref) break;
          else if (p->test > ref) p = p->child1;
          else p = p->child2;
      }

using  two  compares  instead of one compare  and  two  branches.  
Since  over 10% of the CPU time is spent on this  type  operation 
(namely  binary searches),  I tried replacing this loop  with  an 
assembly language function.   However,  the overhead for  calling 
the  function  about equalled the advantage  of  eliminating  the 
duplicate compare.